In this analysis, we will analyze Formula 1 data from the year 2,000 and onward. The reason for this time frame is because of two things: the first begin that there no active drivers that began their F1 careers before the year 2,000, and the second being that, beginning in the year 2,000, it was mandated that all racecars have the same engine size, which would change overtime.
Furthermore, our analysis dives into the relationship between a driver’s starting position and the likelihood of obtaining a certain finishing result. One would assume that the better one qualifies, the higher the change of winning the race. Also, it is known that there are a few big teams/drivers who have dominated the sport over the years and continue to do so, thus providing relatively consistent results. One can easily assume that certain teams are much more likely to win a Grand Prix thus those teams are more statistically significant in determining the result of a race. Because of this, we also investigate the relationship between the results/points given the constructors (teams) and drivers.
Lastly, we create several prediction models that allow for predicting a race results. We use ONLY the parameters that one would have available just before a race: the team name, the driver name, the era (engine size) and the starting position.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(caret)
library(class)Starting Position Vs Race Result
Unfortunately,
due to lack of programming knowledge, we were unable to create the plot
that we wanted. However, we found exactly what we needed online and thus
decided to include that plot with the source tagged just below it.
From the plot, one can see that there does in-fact appear to be a relationship between a driver’s starting position. It is clear that the higher one starts on the grid, the higher the probability of winning the race. In addition these findings, we also noted that the lower one starts on the grid, the higher the probability of not finishing the race (DFN), which could make sense if one assumes that drivers towards the back are driving much less conservatively and thus more aggressive than those towards the front.
Source:
MrBookman_LibraryCop”(Reddit User)
Click
Here to Visit Source
Constructor/Driver Vs Race Result
F1DataFrame = read.csv("ModelingData/F1ModelingCustomCategoricalVariables.csv")
head(F1DataFrame)## X driverId constructorId circuitId raceId year name.x resultId
## 1 1 1 1 3 4 2009 Bahrain Grand Prix 7617
## 2 2 1 131 32 943 2015 Mexican Grand Prix 22858
## 3 3 1 131 14 892 2013 Italian Grand Prix 21962
## 4 4 1 131 35 894 2013 Korean Grand Prix 22002
## 5 5 1 131 73 992 2018 Azerbaijan Grand Prix 23842
## 6 6 1 131 4 884 2013 Spanish Grand Prix 21811
## grid positionOrder name.y alt name forename
## 1 5 4 Bahrain International Circuit 7 McLaren Lewis
## 2 2 2 Autódromo Hermanos Rodríguez 2227 Mercedes Lewis
## 3 12 9 Autodromo Nazionale di Monza 162 Mercedes Lewis
## 4 2 5 Korean International Circuit 0 Mercedes Lewis
## 5 2 1 Baku City Circuit -7 Mercedes Lewis
## 6 2 12 Circuit de Barcelona-Catalunya 109 Mercedes Lewis
## surname fullname era
## 1 Hamilton Lewis Hamilton 90 degrees V8
## 2 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 3 Hamilton Lewis Hamilton 90 degrees V8
## 4 Hamilton Lewis Hamilton 90 degrees V8
## 5 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 6 Hamilton Lewis Hamilton 90 degrees V8
Logic to split data into training and testing sets.
modelVariables = c('positionOrder', 'grid', 'name', 'fullname', 'era')
modelDataframe = F1DataFrame[, modelVariables]
set.seed(4)
trainingIndices = sample(c(1:dim(modelDataframe)[1]), dim(modelDataframe)[1]*0.8)
trainingDataframe = modelDataframe[trainingIndices,]
testingDataframe = modelDataframe[-trainingIndices,]The Model
mymodel<-lm(positionOrder ~ grid +
name + fullname + era, data = F1DataFrame)
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)Model Summary Statistics
summary(mymodel)##
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.006 -3.505 -1.091 2.737 19.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.38543 0.66271 15.671 < 2e-16 ***
## grid 0.35069 0.01151 30.466 < 2e-16 ***
## nameAlphaTauri -1.00718 0.60703 -1.659 0.097110 .
## nameAlpine F1 Team -2.01517 0.67899 -2.968 0.003006 **
## nameArrows 1.15489 0.69475 1.662 0.096486 .
## nameBAR -0.29271 0.57947 -0.505 0.613477
## nameBMW Sauber -3.29514 0.63922 -5.155 2.59e-07 ***
## nameCaterham 1.37396 0.63183 2.175 0.029687 *
## nameFerrari -3.44972 0.45384 -7.601 3.23e-14 ***
## nameForce India -1.61483 0.51394 -3.142 0.001683 **
## nameHaas F1 Team 0.35787 0.56938 0.629 0.529670
## nameHonda -0.41443 0.68954 -0.601 0.547838
## nameHRT 2.93411 0.64075 4.579 4.73e-06 ***
## nameJaguar 0.44880 0.60081 0.747 0.455086
## nameJordan 0.84923 0.57738 1.471 0.141369
## nameLotus F1 -1.71388 0.61496 -2.787 0.005331 **
## nameMarussia 1.35399 0.63763 2.123 0.033741 *
## nameMcLaren -1.58805 0.45032 -3.526 0.000423 ***
## nameMercedes -2.99824 0.51277 -5.847 5.17e-09 ***
## nameMinardi 0.40154 0.58324 0.688 0.491174
## nameOther 0.33400 0.46587 0.717 0.473434
## nameRed Bull -2.80957 0.49236 -5.706 1.19e-08 ***
## nameRenault -1.48916 0.47895 -3.109 0.001882 **
## nameSauber -0.34052 0.47023 -0.724 0.468988
## nameToro Rosso -0.25803 0.47321 -0.545 0.585583
## nameToyota -0.93973 0.54566 -1.722 0.085074 .
## nameWilliams -0.14091 0.44700 -0.315 0.752583
## fullnameCarlos Sainz -2.07692 0.64778 -3.206 0.001350 **
## fullnameDaniel Ricciardo -2.23405 0.61061 -3.659 0.000255 ***
## fullnameDavid Coulthard -1.06185 0.66215 -1.604 0.108830
## fullnameFelipe Massa -2.27072 0.59020 -3.847 0.000120 ***
## fullnameFernando Alonso -2.73761 0.57253 -4.782 1.77e-06 ***
## fullnameGiancarlo Fisichella -2.04084 0.60956 -3.348 0.000817 ***
## fullnameJarno Trulli -0.58987 0.62167 -0.949 0.342725
## fullnameJenson Button -2.98734 0.58761 -5.084 3.77e-07 ***
## fullnameKevin Magnussen -1.41945 0.71095 -1.997 0.045903 *
## fullnameKimi Räikkönen -2.19286 0.58081 -3.776 0.000161 ***
## fullnameLance Stroll -3.01532 0.68531 -4.400 1.10e-05 ***
## fullnameLewis Hamilton -4.51906 0.62100 -7.277 3.69e-13 ***
## fullnameMark Webber -1.60402 0.61988 -2.588 0.009679 **
## fullnameMax Verstappen -3.29763 0.67780 -4.865 1.16e-06 ***
## fullnameMichael Schumacher -1.67495 0.65920 -2.541 0.011074 *
## fullnameNick Heidfeld -1.71395 0.64194 -2.670 0.007599 **
## fullnameNico Hülkenberg -1.19213 0.58740 -2.029 0.042437 *
## fullnameNico Rosberg -2.62344 0.65277 -4.019 5.89e-05 ***
## fullnameOther -1.13431 0.49616 -2.286 0.022266 *
## fullnameRalf Schumacher -2.16772 0.68905 -3.146 0.001661 **
## fullnameRomain Grosjean -0.79773 0.70348 -1.134 0.256832
## fullnameRubens Barrichello -1.96696 0.63711 -3.087 0.002026 **
## fullnameSebastian Vettel -3.13173 0.58791 -5.327 1.02e-07 ***
## fullnameSergio Pérez -2.86729 0.56115 -5.110 3.29e-07 ***
## fullnameValtteri Bottas -3.52762 0.64298 -5.486 4.21e-08 ***
## era90 degrees V8 0.13583 0.15134 0.897 0.369477
## eraV10 -1.21093 0.21407 -5.657 1.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.943 on 9106 degrees of freedom
## Multiple R-squared: 0.3659, Adjusted R-squared: 0.3622
## F-statistic: 99.12 on 53 and 9106 DF, p-value: < 2.2e-16
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
ggplot(data = F1DataFrame, aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="starting vs finishing position by era",
y="finishing position",
x="starting position")
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)mymodel<-lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)summary(mymodel)##
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era,
## data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.305 -3.527 -1.062 2.713 19.492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.18992 0.68222 14.936 < 2e-16 ***
## grid 0.35685 0.01675 21.306 < 2e-16 ***
## nameAlphaTauri -0.96330 0.60783 -1.585 0.113043
## nameAlpine F1 Team -2.01214 0.68050 -2.957 0.003116 **
## nameArrows 1.68047 0.70947 2.369 0.017875 *
## nameBAR -0.18805 0.58039 -0.324 0.745948
## nameBMW Sauber -3.24381 0.63884 -5.078 3.90e-07 ***
## nameCaterham 1.19126 0.63429 1.878 0.060402 .
## nameFerrari -3.45016 0.45662 -7.556 4.56e-14 ***
## nameForce India -1.61940 0.51491 -3.145 0.001666 **
## nameHaas F1 Team 0.37112 0.56909 0.652 0.514330
## nameHonda -0.41850 0.69192 -0.605 0.545302
## nameHRT 2.58153 0.65514 3.940 8.20e-05 ***
## nameJaguar 0.77193 0.60813 1.269 0.204344
## nameJordan 1.19768 0.58538 2.046 0.040788 *
## nameLotus F1 -1.64706 0.61482 -2.679 0.007399 **
## nameMarussia 1.13948 0.64156 1.776 0.075747 .
## nameMcLaren -1.58150 0.45005 -3.514 0.000443 ***
## nameMercedes -2.96328 0.52171 -5.680 1.39e-08 ***
## nameMinardi 1.17922 0.62166 1.897 0.057874 .
## nameOther 0.32732 0.46878 0.698 0.485051
## nameRed Bull -2.67785 0.49441 -5.416 6.24e-08 ***
## nameRenault -1.48690 0.47883 -3.105 0.001907 **
## nameSauber -0.28114 0.47039 -0.598 0.550072
## nameToro Rosso -0.30402 0.47383 -0.642 0.521137
## nameToyota -0.71284 0.54921 -1.298 0.194339
## nameWilliams -0.19097 0.44755 -0.427 0.669601
## fullnameCarlos Sainz -1.92574 0.64850 -2.970 0.002990 **
## fullnameDaniel Ricciardo -2.15690 0.61099 -3.530 0.000417 ***
## fullnameDavid Coulthard -1.13276 0.66200 -1.711 0.087092 .
## fullnameFelipe Massa -2.00781 0.59393 -3.381 0.000726 ***
## fullnameFernando Alonso -2.56675 0.57460 -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949 0.60986 -3.147 0.001653 **
## fullnameJarno Trulli -0.71872 0.62250 -1.155 0.248299
## fullnameJenson Button -2.81085 0.58919 -4.771 1.86e-06 ***
## fullnameKevin Magnussen -1.31376 0.71091 -1.848 0.064634 .
## fullnameKimi Räikkönen -2.05271 0.58146 -3.530 0.000417 ***
## fullnameLance Stroll -2.88345 0.68776 -4.192 2.79e-05 ***
## fullnameLewis Hamilton -4.29182 0.62320 -6.887 6.09e-12 ***
## fullnameMark Webber -1.49148 0.62177 -2.399 0.016470 *
## fullnameMax Verstappen -3.24201 0.67972 -4.770 1.87e-06 ***
## fullnameMichael Schumacher -1.76263 0.66470 -2.652 0.008021 **
## fullnameNick Heidfeld -1.50224 0.64358 -2.334 0.019606 *
## fullnameNico Hülkenberg -1.05863 0.58787 -1.801 0.071766 .
## fullnameNico Rosberg -2.43295 0.65492 -3.715 0.000205 ***
## fullnameOther -1.04727 0.49630 -2.110 0.034871 *
## fullnameRalf Schumacher -2.27859 0.68996 -3.302 0.000962 ***
## fullnameRomain Grosjean -0.69624 0.70355 -0.990 0.322388
## fullnameRubens Barrichello -1.97576 0.63877 -3.093 0.001987 **
## fullnameSebastian Vettel -2.94648 0.58946 -4.999 5.88e-07 ***
## fullnameSergio Pérez -2.77168 0.56127 -4.938 8.02e-07 ***
## fullnameValtteri Bottas -3.38177 0.64398 -5.251 1.54e-07 ***
## era90 degrees V8 -0.19567 0.28781 -0.680 0.496616
## eraV10 -0.45452 0.35751 -1.271 0.203627
## grid:era90 degrees V8 0.03029 0.02157 1.404 0.160292
## grid:eraV10 -0.08097 0.02809 -2.882 0.003959 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared: 0.367, Adjusted R-squared: 0.3632
## F-statistic: 95.97 on 55 and 9104 DF, p-value: < 2.2e-16
F1DataFrame = F1DataFrame %>% filter(F1DataFrame$year >= "2000")
mymodel = lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
simpleModel = lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)Model 3 Summary Statistics
summary(mymodel)##
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era,
## data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.305 -3.527 -1.062 2.713 19.492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.18992 0.68222 14.936 < 2e-16 ***
## grid 0.35685 0.01675 21.306 < 2e-16 ***
## nameAlphaTauri -0.96330 0.60783 -1.585 0.113043
## nameAlpine F1 Team -2.01214 0.68050 -2.957 0.003116 **
## nameArrows 1.68047 0.70947 2.369 0.017875 *
## nameBAR -0.18805 0.58039 -0.324 0.745948
## nameBMW Sauber -3.24381 0.63884 -5.078 3.90e-07 ***
## nameCaterham 1.19126 0.63429 1.878 0.060402 .
## nameFerrari -3.45016 0.45662 -7.556 4.56e-14 ***
## nameForce India -1.61940 0.51491 -3.145 0.001666 **
## nameHaas F1 Team 0.37112 0.56909 0.652 0.514330
## nameHonda -0.41850 0.69192 -0.605 0.545302
## nameHRT 2.58153 0.65514 3.940 8.20e-05 ***
## nameJaguar 0.77193 0.60813 1.269 0.204344
## nameJordan 1.19768 0.58538 2.046 0.040788 *
## nameLotus F1 -1.64706 0.61482 -2.679 0.007399 **
## nameMarussia 1.13948 0.64156 1.776 0.075747 .
## nameMcLaren -1.58150 0.45005 -3.514 0.000443 ***
## nameMercedes -2.96328 0.52171 -5.680 1.39e-08 ***
## nameMinardi 1.17922 0.62166 1.897 0.057874 .
## nameOther 0.32732 0.46878 0.698 0.485051
## nameRed Bull -2.67785 0.49441 -5.416 6.24e-08 ***
## nameRenault -1.48690 0.47883 -3.105 0.001907 **
## nameSauber -0.28114 0.47039 -0.598 0.550072
## nameToro Rosso -0.30402 0.47383 -0.642 0.521137
## nameToyota -0.71284 0.54921 -1.298 0.194339
## nameWilliams -0.19097 0.44755 -0.427 0.669601
## fullnameCarlos Sainz -1.92574 0.64850 -2.970 0.002990 **
## fullnameDaniel Ricciardo -2.15690 0.61099 -3.530 0.000417 ***
## fullnameDavid Coulthard -1.13276 0.66200 -1.711 0.087092 .
## fullnameFelipe Massa -2.00781 0.59393 -3.381 0.000726 ***
## fullnameFernando Alonso -2.56675 0.57460 -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949 0.60986 -3.147 0.001653 **
## fullnameJarno Trulli -0.71872 0.62250 -1.155 0.248299
## fullnameJenson Button -2.81085 0.58919 -4.771 1.86e-06 ***
## fullnameKevin Magnussen -1.31376 0.71091 -1.848 0.064634 .
## fullnameKimi Räikkönen -2.05271 0.58146 -3.530 0.000417 ***
## fullnameLance Stroll -2.88345 0.68776 -4.192 2.79e-05 ***
## fullnameLewis Hamilton -4.29182 0.62320 -6.887 6.09e-12 ***
## fullnameMark Webber -1.49148 0.62177 -2.399 0.016470 *
## fullnameMax Verstappen -3.24201 0.67972 -4.770 1.87e-06 ***
## fullnameMichael Schumacher -1.76263 0.66470 -2.652 0.008021 **
## fullnameNick Heidfeld -1.50224 0.64358 -2.334 0.019606 *
## fullnameNico Hülkenberg -1.05863 0.58787 -1.801 0.071766 .
## fullnameNico Rosberg -2.43295 0.65492 -3.715 0.000205 ***
## fullnameOther -1.04727 0.49630 -2.110 0.034871 *
## fullnameRalf Schumacher -2.27859 0.68996 -3.302 0.000962 ***
## fullnameRomain Grosjean -0.69624 0.70355 -0.990 0.322388
## fullnameRubens Barrichello -1.97576 0.63877 -3.093 0.001987 **
## fullnameSebastian Vettel -2.94648 0.58946 -4.999 5.88e-07 ***
## fullnameSergio Pérez -2.77168 0.56127 -4.938 8.02e-07 ***
## fullnameValtteri Bottas -3.38177 0.64398 -5.251 1.54e-07 ***
## era90 degrees V8 -0.19567 0.28781 -0.680 0.496616
## eraV10 -0.45452 0.35751 -1.271 0.203627
## grid:era90 degrees V8 0.03029 0.02157 1.404 0.160292
## grid:eraV10 -0.08097 0.02809 -2.882 0.003959 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared: 0.367, Adjusted R-squared: 0.3632
## F-statistic: 95.97 on 55 and 9104 DF, p-value: < 2.2e-16
summary(simpleModel)##
## Call:
## lm(formula = positionOrder ~ grid, data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.289 -3.824 -1.086 3.015 18.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.070542 0.109514 46.30 <2e-16 ***
## grid 0.550774 0.008669 63.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.157 on 9158 degrees of freedom
## Multiple R-squared: 0.3059, Adjusted R-squared: 0.3058
## F-statistic: 4036 on 1 and 9158 DF, p-value: < 2.2e-16
V10plot = F1DataFrame %>% filter(F1DataFrame$era == "V10") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="starting vs finishing position by era",
y="finishing position",
x="starting position")
V8plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V8") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="starting vs finishing position by era",
y="finishing position",
x="starting position")
V8MGUsplot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V6 + MGUs") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="starting vs finishing position by era",
y="finishing position",
x="starting position")
V10plotV8plotV8MGUsplotset.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
knn.fit<-train(positionOrder~.,
data=modelDataframe,
method="knn",
trControl=fitControl,
tuneGrid=expand.grid(k=c(1:10,20,30))
)
knn.fit## k-Nearest Neighbors
##
## 9160 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 8244, 8244, 8244, 8244, 8243, 8244, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 5.532697 0.2637072 4.210911
## 2 5.300432 0.2979288 4.086467
## 3 5.198577 0.3150419 4.022011
## 4 5.146384 0.3242655 3.991361
## 5 5.098006 0.3338238 3.959075
## 6 5.060262 0.3413419 3.934518
## 7 5.042913 0.3446800 3.925552
## 8 5.032698 0.3466881 3.916071
## 9 5.021445 0.3489045 3.911490
## 10 5.008533 0.3515009 3.906285
## 20 4.981941 0.3562147 3.897474
## 30 4.969751 0.3587330 3.898560
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 30.
# THANK YOU!